EQ Challenge Notebook

Content

  1. Data Collection
  2. Clean-Up
  3. Label
  4. Analysis
  5. Data Science/Engineering Tracks a. Model
     1- Visualize
     2- Bonus
    
    b. Pipeline Dependency

0. Data Collection

Import and Install necessary libraries.

1. Clean-up Data

Records that have identical geoinfo and timest as are considered suspicious and should be removed

Remove Duplicate POI with the same location

2. Label

Assign each request to the closest POI

3. Analysis

Calculate Average Distance, Standard Deviation and density of each POI.

4. Data Science/Engineering Tracks

4a. Model

Popularity index (PI) in finance is defined as the ratio of portion sales for a given menu item to total portion sales. In this question, we will consider the menu items as Point of Interests (POI's) and the sales as the request to a given POI. The conventional way to solve PI is as follows: PI(POIn)=Sum(Requests to POIn) / Sum(Requests to all POI's)

As to show some of my skills in Machine Learning and to apply a more accurate technique that can solve PI as data streams to the Dataframe, I will apply a reinforcement Learning algorithm ("The multi-armed bandit model").

The Multi-armed Bandit Model

A multi-armed bandit is a complicated slot machine wherein instead of 1, there are several levers which a gambler can pull, with each lever giving a different return. The probability distribution for the reward corresponding to each lever is different and is unknown to the gambler. The task is to identify which lever to pull in order to get a maximum reward after a given set of trials. This problem statement is like a single step Markov decision process, Each arm chosen is equivalent to an action, which then leads to an immediate reward.

To translate this problem to our case, bandits arms are considered POI's and gamblers are considered requests to a given POI.

For the sake of context, this assumption is based on my understanding of the challenge problem. Furthermore, we considering POI Density or simply Number of Requests/ Area as the main factor of popularity instead of Number of Requests which will lead to a different ranking index that is affected by POI coverage area.

One way to implement Multi-armed Bandit is by Upper Confidence Bound algorithm. Upper Confidence Bound (UCB) is the most widely used solution method for multi-armed bandit problems. This algorithm is based on the principle of optimism in the face of uncertainty. Meaning that, the more uncertain we are about an arm, the more important it becomes to explore that arm.

For example, the following figure shows the Distribution of action-value functions (requests) for 3 different POI's POI1, POI2, and POI3 after several requests. This distribution shows that the action value for POI1 has the highest variance and hence maximum uncertainty. UCB says that we should choose the POI POI1 and receive a reward making us less uncertain about its action value. For the next request/timestep, if we still are very uncertain about POI11, we will choose it again until the uncertainty is reduced below a threshold.

im.jpeg

UBC Model:

  1. Play each of the K actions once, giving initial values for mean rewards corresponding to each action POIt
  2. For each round t = K: Let Nt(POI) represent the number of times POI was requested until time t select the POI that maximise the following expression:

UBC_math.jpeg

Each time a is selected, the uncertainty is presumably reduced: Nt(a) increments, and, as it appears in the denominator, the uncertainty term decreases. where Q(POI)is a random exploration factor.

Let's first, remove the outlier requests (away from Inter Quartile Range IQR) which are too far away from the POI. IQR is a measure of statistical dispersion, which is equal to the difference between the 75th percentile and the 25th percentile. IQR Can be used to detect request outliers in a few easy and straightforward steps:

  1. Calculate the 1st quartile Q1.
  2. Calculate the 3rd quartile Q3.
  3. Calculate IQR=Q3−Q1.
  4. Calculate the Requests Range:
    • Lower bound: Q1−1.5∗IQR
    • Upper bound: Q3+1.5∗IQR
  5. Remove any points outside the Requests Range as suspected outliers.

Reinforcement Learning : Solving the Multi-Armed Bandit Problem

Implementing UBC

Bonus

Show POI and request Distribution on Map

To have some insights about the geographical distribution of requests around POI's, a geographical map is implemented.

Hypotheses

Based on the geographical distribution of the POI and the Distance and Density analysis one can draw several Hypotheses.

  1. POI's location are selected to cover most most population.
  2. POI's location is ment to be away from each other.
  3. POI's provinance are chosen to be at low tax cities.
  4. POI's Location where chosen to be in Canada.

Five Steps for each Hypothesis Testing:

  1. Specify the Null Hypothesis.
  2. Specify the Alternative Hypothesis.
  3. Set the Significance Level (a)
  4. Calculate the Test Statistic and Corresponding P-Value.
  5. Drawing a Conclusion.

Note: Solution Attached

4b. Pipeline Dependency

In order to maintain a flexible update and scalable graph, we can construct a Data Pipeline using Spark Data Streaming Package and GraphX. GraphX is a new component in Spark for graphs and graph-parallel computation. At a high level, GraphX extends the Spark RDD by introducing a new Graph abstraction: a directed multigraph with properties attached to each vertex and edge. To support graph computation, GraphX exposes a set of fundamental operators (e.g., subgraph, joinVertices, and aggregateMessages) as well as an optimized variant of the Pregel API. Unfortunately, GraphX is not yet supported in PySpark. However, an alternative package called GraphFrames can handle most of the GraphX functionality.

In this solution my approach will be as follows:

  1. Create a stream Reader (spark.readStream) using Scala Structure Stream Reader, so we can Update tasks and Relation immediately in the streaming Source folder.
  2. Input Streamed data to appropriate Queries to import data as required by GraphFrames.
  3. Find a Reverse path from destination to source taking into consideration all path dependencies: to do that, we can perform one of the following approaches:
    • a. perform a Breadth-First Search (BFS) from destination to source and print all visited neighbors along the path. (However, GraphFrames BFS currently don't allow to print visited nodes.)
    • b. Use GraphX Pregel (parallel graph google) API which allows us to send the message in the graph from one node and aggregate it from another edge where we can apply any function on each visited node. In our problem, we can send broadcast messages from "destination task" and collect at "source task" then collect the message visited node id's.

Alternative partial solution Using Graphframes BFS